Word Co-occurrence Matrix and Context Dependent Class in LSA based Language Model for Speech Recognition

نویسندگان

  • Welly Naptali
  • Masatoshi Tsuchiya
  • Seiichi Nakagawa
چکیده

A data sparseness problem for modeling a language often occurs in many language models (LMs). This problem is caused by the insufficiency of training data, which in turn, makes the infrequent words have unreliable probability. Mapping from words into classes gives the infrequent words more confident probability, because they can rely on other more frequent words in the same class. In this research, we investigates a class LM based on a latent semantic analysis (LSA). A word-document matrix is commonly used to represent a collection of text (corpus) in LSA framework. This matrix tells how many times a word occurs in a certain document. In other words, this matrix ignores the word order in the sentence. We propose several word co-occurrence matrices that keep the word order. By applying LSA to these matrices, words in the vocabulary are projected to a continues vector space according to their position in the sentences. To support this matrices, we also define a context dependent class (CDC) LM. Unlike traditional class LM, CDC LM distinguishes classes according to their context in the sentences. Experiments on Wall Street Journal (WSJ) corpus show that the word co-occurrence matrix works 3.62%-12.72 better than worddocument matrix. Furthermore, the CDC improves the performance and achieves better perplexity than the traditional class LM based on LSA. When the model is linearly interpolated with the word-based trigram, it gives improvements about 2.01% for trigram model and 9.47% for fourgram model on relative perplexity against a standard word-based trigram LM. Keywords— Latent semantic analysis, Language model, Word cooccurrence matrix, Context dependent class

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Context Dependent Class Language Model based on Word Co-occurrence Matrix in LSA Framework for Speech Recognition

We address the issue of data sparseness problem in language model (LM). Using class LM is one way to avoid this problem. In class LM, infrequent words are supported by more frequent words in the same class. This paper investigates a class LM based on LSA. A word-document matrix is usually used to represent a corpus in LSA framework. However, this matrix ignores word order in the sentence. We pr...

متن کامل

Allophone-based acoustic modeling for Persian phoneme recognition

Phoneme recognition is one of the fundamental phases of automatic speech recognition. Coarticulation which refers to the integration of sounds, is one of the important obstacles in phoneme recognition. In other words, each phone is influenced and changed by the characteristics of its neighbor phones, and coarticulation is responsible for most of these changes. The idea of modeling the effects o...

متن کامل

The Context Dependent Sentence Abstraction model

The Context Dependent Sentence Abstraction (CDSA) model and Latent Semantic Analysis (LSA) were compared in their ability to predict sentence similarity. Evidence supports the conclusion that the CDSA model better predicts human ratings for short phrases and sentences than does LSA. Alternative theoretical reasons are given for this finding. Introduction Researchers in many disciplines within c...

متن کامل

Computing Semantic Representations: A Comparative Analysis

How can we formally capture the complex semantic relationships of the human lexicon? This question has been the focus of much recent computational studies. The ability to represent semantics faithfully in formal mechanisms not only is important for understanding the nature of the lexical system of natural languages, but also has significant implications for understanding the mental representati...

متن کامل

Voice-based Age and Gender Recognition using Training Generative Sparse Model

Abstract: Gender recognition and age detection are important problems in telephone speech processing to investigate the identity of an individual using voice characteristics. In this paper a new gender and age recognition system is introduced based on generative incoherent models learned using sparse non-negative matrix factorization and atom correction post-processing method. Similar to genera...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009